Model comparison (done correctly) helps to choose the model that provides a good representation of the true DGP by penalizing models that overfit. This penalization is achieved mainly by assessing “fit” not on a training data set, but on a hold out test data set.
A complementary approach to work against overfitting is to specify priors that shrink model coefficients towards zero. Such shrinkage priors are typically
Here relative refers to the scale on which a predictor is measured.
To visualize how shrinkage works, we generate a small data set for which we want to estimate the mean:
set.seed(5)
y = (rnorm(15) %>% scale()) + 1
hist(y)
We want a Bayesian estimate of the mean of y: \(\small y \sim normal(mu, 1)\) with two different priors \(\small mu \sim normal(0,\sigma_{wide})\) and \(\small mu \sim normal(0,\sigma_{shrink})\) for mu:
par(mar = c(5,3,.5,.1))
curve(dnorm(x,0,1), -10, 10, n = 500,
ylab = "density", xlab = "mu", col = "blue", lwd = 3)
curve(dnorm(x,0,3), n = 500, add = T,
ylab = "density", xlab = "mu", col = adjustcolor("blue",alpha = .5), lwd = 3)
legend("topleft",
lty = 1, lwd = 3,
col = c("blue",adjustcolor("blue",alpha = .5)),
legend = c("mu ~ normal(0,1)","mu ~ normal(0,3)"),
bty = "n")
Here, we calculate the log-likelihood and the prior probability of different estimates for the mean of y
mus = seq(-1,3,.01)
P =
apply(
data.frame(mu = mus), 1,
function(mu) {
c(
mu = mu,
# P(data| mu)
llikelihood = sum(dnorm(y,mean = mu, log = T)),
# P(mu), mu ~ normal(0,1)
Psd1 = dnorm(mu,0, sd = 1, log = T),
# P(mu), mu ~ normal(0,3)
Psd3 = dnorm(mu,0, sd = 3, log = T)
)
}
)
P[,1:5]
## [,1] [,2] [,3] [,4] [,5]
## mu.mu -1.000000 -0.990000 -0.980000 -0.970000 -0.960000
## llikelihood -50.784078 -50.484828 -50.187078 -49.890828 -49.596078
## Psd1.mu -1.418939 -1.408989 -1.399139 -1.389389 -1.379739
## Psd3.mu -2.073106 -2.072001 -2.070906 -2.069823 -2.068751
Next, we calculate the posterior probabilities:
post1 = exp(P["llikelihood",] + P["Psd1.mu",])
post3 = exp(P["llikelihood",] + P["Psd3.mu",])
Now we can show likelihood, prior and posterior together:
par(mfrow = c(3,1), mar=c(2.75,2.75,0,.25), mgp=c(1.25,.1,0), tck=-.01)
curve(dnorm(x,0,1), min(mus), max(mus), n = 500,
ylab = "P(mu)", xlab = "", col = "blue", lwd = 3)
curve(dnorm(x,0,3), n = 500, add = T,
ylab = "density", xlab = "", col = adjustcolor("blue",alpha = .5), lwd = 3)
legend("topright",
lty = 1, lwd = 3,
col = c("blue",adjustcolor("blue",alpha = .5)),
legend = c("mu ~ normal(0,1)","mu ~ normal(0,3)"),
bty = "n")
mtext("Prior", line = -2, adj = .05)
plot(mus,exp(P["llikelihood",]),'l', xlab = "", ylab = "P(y|mu)", col = "red", lwd = 2)
mtext("Likelihood", line = -2, adj = .05)
plot(mus,post1,'l', xlab = "mu", ylab = "P(y|mu) * P(mu)", col = "purple", lwd = 2)
lines(mus,post3,'l', col = adjustcolor("purple",alpha = .5), lwd = 2)
abline(v = c(mus[which.max(post1)],mus[which.max(post3)]), lty = 3,
col = c("purple",adjustcolor("purple",alpha = .5)))
legend("topright",
lty = 1, lwd = 3,
col = c("purple",adjustcolor("purple",alpha = .5)),
legend = c("mu ~ normal(0,1)","mu ~ normal(0,3)"),
bty = "n")
mtext("Posterior", line = -2, adj = .05)
The important thing to keep in mind here is that the posterior is the product of prior and likelihood (or the sum of log(prior) and log(likelihood)).
We zoom in to see more clearly how the narrow prior shrinks the estimated mu towards zero.
par(mar = c(5,3,.5,.1))
ids = which(mus > 0 & mus < 1.5)
plot(mus[ids],post1[ids],'l', xlab = "mu", ylab = "P(y|mu) * P(mu)", col = "purple", lwd = 2)
lines(mus[ids],post3[ids],'l', col = adjustcolor("purple",alpha = .5), lwd = 2)
abline(v = c(mus[which.max(post1)],mus[which.max(post3)]), lty = 3,
col = c("purple",adjustcolor("purple",alpha = .5)))
arrows(x0 = mus[which.max(post3)],
x1 = mus[which.max(post1)],
y0 = mean(c(max(post1),max(post3))),
lwd = 3, length = .15)
legend("topright",
lty = 1, lwd = 3,
col = c("purple",adjustcolor("purple",alpha = .5)),
legend = c("mu ~ normal(0,1)","mu ~ normal(0,3)"),
bty = "n")
To show that shrinkage works, we estimate spline models with different standard deviations on regression coefficients for the simulated income / well being data above.
N = 1000
set.seed(123456)
x = sort(rnorm(N,mean = 0))
bf.s = splines::bs(x,df = 3)
bf.b = matrix(c(1,1,1),ncol = 1)
mu = bf.s %*% bf.b
y = rnorm(N, mu , sd = .1)
par(mfrow = c(1,1))
plot(x,y)
lines(x,mu, col = "red", lwd = 2)
The following figure shows the estimated relationships for different samples drawn from the population.